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The branching program 
proQuced by the boosting algo- 
rithm. Each node Vi,t is labeled 
with a 0/1-valued function hi,i; 
left edges correspond to 0 and 
right edges to 1. 
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SYSTEMS AND METHODS FOR MARTINGALE BOOSTING IN MACHINE 
LEARNING 

5 SPECIFICATION 

CROSS-REFERENCE TO RELATED APPLICATIONS 

This application claims the benefit of United States provisional 
application Serial No. 60/716,615 filed September 13, 2005, which is hereby 
incorporated by reference herein in its entirety. 

10 FIELD OF THE INVENTION 

The present invention relates to systems and methods for machine 

learning. 

BACKGROUND OF THE INVENTION 
Computational learning or machine learning is about computer 
1 5 programs or algorithms that automatically improve their performance through 

experience over time. Machine learning algorithms can be exploited for automatic 
performance improvement through learning in many fields including, for example, 
plarming and scheduling, bio-informatics, natural language processing, information 
retrieval, speech processing, behavior prediction, and face and handwiiting 
20 recognition. 

An approach to developing usefiil machine learning algorithms is 
based on statistical modeling of data. With a statistical model in hand, probability 
theory and decision theory can be used to develop machine learning algorithms. 
Statistical models that are commonly used for developing machine learning 

25 algorithms may include, for example, regression,, neural network, linear classifier, 
support vector machine, Markov chain, and decision tree models. This statistical 
approach may be contrasted to other approaches in which training data is used merely 
to select among different algorithms or to approaches in which heuristics or common 
sense is used to design an algorithm. 

30 In mathematical terms, a goal of machine learning is to be able to 

predict the value of a random variable y from a measurement x (e.g., predicting the 
value of engine efficiency based on a measurement of oil pressure in an engine). The 
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machine learning processes may involve statistical data resampling techniques or 
procedures such as bootstrapping, bagging, and boosting, which allow extraction of 
additional information from a training data set. 

The technique of bootstrapping was originally developed in statistical 
data analysis to help determine how much the results extracted from a training data set 
might have changed if another random sample had been used instead, or how different 
the results might be when a model is applied to new data. In bootstrapping, 
resampling is used to generate multiple versions of the training data set (replications). 
A separate analysis is conducted for each replication, and then the results ai-e 
averaged. If the separate analyses differ considerably from each other, suggesting, for 
example, decision tree instability, the averaging will stabilize the results and yield 
predictions that ai-e more accurate. In bootstrap aggregation (or bagging) procedures, 
each new resample is drawn in the identical way. In boosting procedures, the way a 
resample is drawn for the next tree depends on the performance of prior trees. 

Although boosting procedures may theoretically yield significant 
reduction in predictive eiTor, they perform poorly when error or noise exists in the 
training data set. The poor perfomance of boosting procedures is often a result of 
over-fitting the training data set, since the later resampled training sets can over- 
emphasize examples that are noise. Further, recent attempts to provide noise-tolerant 
boosting algorithms fail to provide acceptable solutions for practical or realistic data 
situations, for example, because their methods for updating probabilities can over- 
emphasize noisy data examples. Accordingly, a need exists for a boosting procedure 
having good predictive characteristics even when applied to practical noisy data sets. 

Consideration is now being given to improving prior art systems and 
methods for machine learning. Attention is pmlicularly directed to improving 
boosting procedures.. Desirable boosting procedures are noise-tolerant in realistic or 
practical data situations. 

SUMMARY OF THE INVENTION 

Systems and methods are provided for machine learning in the 
presence of noise. 

In an exemplary embodiment, a machine learning method having 
multiple learning stages is provided. Each learning stage may include partitioning 
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examples into bins, choosing a base classifier for each bin, and assigning an example 
to a bin by counting the number of positive predictions previously made by the base 
classifier associated with the particular bin. 

BRIEF DESCRIPTIQN OF THE DRAWINGS 

5 Further features of the invention, its nature, and various advantages 

will be more apparent firom the following detailed description and the accompanying 
drawing in which: 

FIG. 1 is a schematic illustration of a machine learning branching 
program produced by a martingale boosting algorithm in accordance with the 
1 0 principles of the present invention. 

FIG. 2 illust-ates a machine learning process for ranking feeders in an 
electrical power distribution system in order of their predicted likelihood of failure, in 
accordance with the principles of the present invention. 

DETAILED DESCRIPTION 

15 Machine learning systems and methods are provided. The systems and 

methods are based on noise-tolerant boosting algorithms. The systems and methods 
use boosting techniques that can achieve high accuracy in the presence of 
misclassification noise. The boosting algorithms (referred to herein as "martingale" 
boosting algorithms) are designed to reweigh data examples so that error rates are 

20 balanced or nearly balanced at each successive learning stage. The error rates are 
balanced or nearly balanced in a maimer that preserves noise tolerance. 

A macliine learning system for automated learning using martingale 
boosting combines simple predictors into more sophisticated aggregate predictors. 
Learning proceeds in stages. At each stage, the algorithm partitions training data 

25 examples into bins. A bin consists of examples that are regarded as roughly 

equivalent by the simple predictors chosen in earlier stage. The boosting algoritlim 
chooses a simple model for each bin. The simple models are chosen so as to ensure 
nontrivial accuracy on examples in the bins for each of several types of bins. 

An embodiment of the martingale boosting technique ranks items or 

30 objects in order of the likelihood that they have a particular property, behavior or 
characteristic. This embodiment has been applied to order power distribution cables 
(i.e., feeders) in an electrical power distribution system by how likely they are to fail. 
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A machine learning system used for predicting the failure of feeders in 
an electrical power distribution system includes a boosting algorithm. Past feeder 
failure events are known and the feeders are associated with a plurality of scores that 
are predictive of feeder failure. The algorithm processes a list of feeders and the 
5 associated plurality of scores in a number of successive learning stages. At each 
learning stage, the list of feeders is partitioned into a number of sublists so that the 
past feeder failure events are distributed substantially evenly across the number of 
sublists. For each sublist, a predictive score is chosen from the plurality of predictive 
scores associated mih the objects in the sublist. Next, the feeders in the sublist are 

10 ranked according to the chosen predictive score. Then, the sublists are recombined to 
generate a list in which the feeders are ranked according to the predictive scores 
chosen for the respective sublists. 

An example of the martingale boosting technique concerns the 
prediction of binary classifications (i.e., 0 and 1). Here, the simple predictors are 

15 simple binary classifiers ("base classifiers"). Learning proceeds incrementally in 
stages. At each stage, data examples are partitioned into bins, and a separate base 
classifier is chosen for each bin. A data example is assigned to a bin by counting the 
number of positive (i.e., "1") predictions made by the appropriate base classifiers 
from earUer learning stages or iterations. Preferred embodiments of the boosting 

20 techniques are designed to classify an object by a random walk on tiie number of base 
classifiers that are positive predictions. When the error rates are balanced between 
false positives and false negatives, and are slightly better than random guessing, more 
than half the algorithmic learning steps are in the correct direction (i.e., the data 
examples are classified correctiy by the boosted classifier). 

25 Certain embodiments of the martingale boosting algorithms achieve 

noise tolerance by vurtiie of the fact that, by design, the probability of a data example 
reaching a given bin depends on the predictions made by die earlier base classifiers, 
and not on tiie label of the data example. In particular, the probability of a data 
example reaching a given bin, unlike the. case in prior art boosting algorithms such as 

30 "Boost-by-Majority" algorithms, does not depend on tiie number of predictions that 
are correct or incorrect. 

Certain embodiments of the martingale boosting algorithms also make 
it possible to force a standard weak learner to produce a classifier with balanced error 
rates in appropriate situations. For example, if decision tiree stamps are used as tiie 
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base classifiers, the threshold of the stump may be chosen to balance the error rates on 
positive and negative examples. In some embodiments of the inventive martingale 
boosting algorithms, the balanced error rates may be promoted directly, for example, 
by using decision stumps as base classifiers ("martingale ranking algorithms"). Such 
5 embodiments allow easy adjustment of the threshold required to balance the error 
rates on the training data. 

The general ai-chitecture or framework of a martingale boosting 
algorithm is described herein with reference to FIG. 1, which shows the graph 
structure of a machine learning branching program produced by the martingale 
10 boosting algorithm. In the figure, each node Vi,tofthe branching program is labeled 
with a binary- valued fimction hi,t having values 0 or 1 . At each node shown in the 
figure, the left edges correspond to 0 and the right edges to 1. 

As an aid in understanding the martingale boosting algorithms, it is 
useful at this stage in the description to consider the training data examples as being 
1 5 generated fi-om a probability distribution. Fuilher, it is useful to introduce the 
following notation: X is the set of items to be classified, and c : X — >{0,1 } is the 
target concept, which assigns the correct classification to each item. The distribution 
over X generating the data is called D. D"^ denotes the distribution D restricted to the 
positive examples {x e X: c(x) = 1}. Thus, for any event: 

20 Sc{x£X:c(x)=l},PrD+[xeS] = PrD[xeS]/PrD[c(x) = l]. (1) 

Similarly, D~ denotes D restricted to the negative examples {x e X : c(x) = 0}. 

The boosting algorithm shown in FIG. 1 works in a series of T stages. 

The hypothesis of the boosting algorithm is a layered branching program with T + 1 

layers in a grid graph structure, where layer t has t nodes (see FIG. 1). The i-th node 
25 from the left is referred to and labeled as Vi_t, where i ranges from 1 to t-1. For 1< t < 

T, each node Vi_t in layer t has two outgoing edges - a left edge to node Vi_t+i, and a 

right edge to node Vi+i,t+i. In FIG. 1 the left and right edges are labeled 0 and 1, 

respectively. Nodes Vi,t+i in layer T+1 have no outgoing edges. 

Before stage t of the boosting algorithm begins, each node vjj at levels 
30 1, . . ., t-1 is labeled with a 0/1 valued hypothesis function hjj. In the t-th stage, 

hypothesis fionctions are assigned to each of the t nodes vqj through Vi,t+i, at level t. 
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Given an example x e X in stage t, the branching program routes the example by 
evaluating ho.i on x and then sending the example on the outgoing edge whose label 
is ho.i (x), i.e., sending it to node Vho.i (x), i- The example is then routed through 
successive levels in this way until it reaches level t. In other words, when example x 
5 reaches some node vqj in level j, it is routed from there via the outgoing edge whose 
label is hij (x) to the node Vj+hy {x)j+i. In this fashion, the example x eventually 
reaches the node v/.t after being evaluated on (t-1) hypotheses, where / is the number 
of these (t-1) hypotheses that evaluated to 1 on x. 

Thus, in the t-th stage of boosting, given an initial distribution D over 

10 examples x, the hypotheses that have been assigned to nodes at levels 1, . . ., t-1 of the 
branching program induce t different distributions Do. t, . . ., Dm, t corresponding to the 
t nodes vo,t, • • vn.t in layer t. It will be understood that a random draw x from 
distribution Do.t is a draw from D conditioned on x reaching Vi,t. 

Once all T stages of boosting have been performed, the resulting 

1 5 branching program routes any example x to some node vo,t+i at level T + 1 . Let / 
denote the number of hypotheses that evaluated to 1 out of the T hypotheses, which 
were evaluated on x. The final classifier computed by the branching program is 
simple: given an example x to classify, if the fmal node v/j+i that x reaches has / > 
T/2, then the output is 1 ; otherwise the output is 0. 

20 It vwU be noted that the martingale boosting algorithm described with 

reference to FIG. 1 invokes the weak learner t separate times in stage t, once for each 
of the t distinct Do,t, . . ., Dt.i,t corresponding to the t nodes vo.t, . . .,Vt.i,t in layer t. The 
hypothesis hi,t is not obtained merely by ruiming the weak learner on Di,tand taking 
the resulting hypothesis to be hi,t, but by constructing a total of T(T + l)/2 weak 

25 hypotheses. Any single example x encounters only T of these hypotheses in its path 
through the branching program. 

The martingale boosting algorithms are designed to combine predictor 
methods for sorted objects into classes, each of which are weak on their own, but 
which might be combined to form a strong aggregate predictor. The algorithms may 

30 be modified to combine continuous scores or figures of merit instead of combining 
discrete or binary (e.g., yes or no) predictions. 

The martingale boosting algorithms of the present invention can be 
used for boosting a two-sided weak learner h. For example, c : X -» {0,1 } may be a 
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target fiinction to be learnt witli high accuracy with respect to the distribution D over 
X. In this example, the distributions D'^ and D~ are defined with respect to c. By 
definition, a hypothesis h : X {0,1 } is said to have a two-sided advantage y with 
respect to D if it satisfies both: 

5 Prx g m[h(x) = 1] > '/2 + Y, (2a) 

and 

PrxeD"[h(x) = 0]> '/2 + Y. (2b) 
Such a hypothesis performs noticeably better than random guessing both on positive 
examples and on negative examples. A two-sided weak learner h, when invoked on 
1 0 target concept c and distribution D. outputs a hypothesis with a two-sided advantage 
y. The analysis of a standai-d weak learner may be reduced to the case of the two- 
sided model. 

The general boosting framework described above with reference to 
FIG. 1 can be used to boost a two-sided weak learner h to high accuracy. In a two- 

1 5 sided boosting scheme ("Basic MartiBoosf '). in learning stage t at each node Vi,t the 
two-sided weak learner is run on examples drawn from Di_t, which is the distribution 
obtained by filtering D to accept only those examples that reach node Vi,t. The 
resulting hypothesis, which has a two-sided advantage y with respect to Di,t, is then 
used as the hypothesis fiinction hi_t labeling node Vi,t. 

20 . In the Basic MartiBoost scheme, let h denote the final branching 

program that is constructed by the algorithm. A random example x drawn from D"^ 
(i.e., a random positive example) is routed through h according to a random walk that 
is biased toward the right, and a random example x drawn from D" is routed through h 
according to a random walk that is biased toward the left. Example x is classified by 

25 h according to whether x reaches a final node v/,t+i with / >T/2 or / <T/2. This 

classification unplies that h has high accuracy on both random positive examples and 
random negative examples. A random positive example x (i.e., x is distributed 
according to D+) follows a random walk biased to the right. Conversely, a random 
negative example follows a random walk biased to the left. For any node v,-,, 

30 conditioned on positive example x reaching node v/,,, x is distributed according to 
(Di,t)^. Consequently, by the definition of two-sided advantage, x goes from node v,- 
to node v,+;,t+i, with a probability of at least V2 + y (i.e., x follows a random walk 
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biased to the right). Similarly, for any node v,-,,, a random negative example x that 
reaches node V/,/ will proceed to node vy,t+i with a probability of at least 1/2 + 7. Thus 
random negative examples follow a random waUc biased to the left. 

The standard bounds on random walks imply that if T = O (log (1/e) / 
Y^), then the probability that a random positive example x ends up at a node yi,T^i is at 
most e. The same holds for random negative examples, and thus h has an overall 
accuracy at least 1-e with respect to D. Theorem 1 below holds for the two-sided 
Basic MartiBoosting algorithm. 

Theorem 1. Let 71, 72, • ■ -i Yt be any sequence of values between 0 and 
'/z. For each value t = 1, . . ., T, suppose that each of the t invocations of the wealc 
learner on distributions Di,t with 0 < i < t - 1 yields a hypothesis hj_t, which has a two- 
sided advantage 7t with respect to Di,t. In these conditions, the final output hypothesis 
that the Basic MartiBoost algorithm computes will satisfy: 

Pr, e D+[h(x) ^ c(x)] < exp(- (S Yt)' / (2T)). (3) 

For brevity, formal mathematical proofs of Theorem 1 and other 
related Theorems 2-6 discussed herem are not included herein. However, formal 
mathematical proofs of the theorems, properties, and features of the inventive 
martingale boosting algorithms can be found in P. Long and R. Servedio, "Martingale 
Boosting," Eighteenth Annual Conference on Computational Learning Theory 
(COLT), 2005, pp. 79-94, which is incorporated by reference herein in its entirety. 

The usual assumption made in boosting data analysis is the availability 
of access to a standard weak learning algorithm, which when invoked on target 
concept c and distribution D outputs a hypothesis h that has an advantage with respect 
to D. By definition, a hypothesis h : X — » {0, 1 } is said to have advantage y with 
respect to D if it satisfies: 

PrxeD[h(x)= c(x)]>'/a + Y. (4) 

This assumption is less demanding than the two-sided weak learner considered above. 
However, the Basic MartiBoost algorithm for the two-sided weak learner can be 
modified to boost a standard weak learner to high accuracy. 

The modified algorithm ("MartiBoost") to boost a weak learner works 
as follows: In stage t, at each node v^, the weak learning algorithm is run on Dj t, 
which is a balanced version of the distribution Di,t(i.e., which puts equal weight on 
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positive and negative examples). If gi.t denotes the hypotlaesis that the weak learner 
returns, the hypothesis hj^t that is used to label Vi_tis given by gi,t, namely gi,t balanced 
with respect to the balanced distribution Di,t. 

Theorem 2 below holds for the MartiBoosting algorithm. 
5 Theorem 2. Let y\, 72, . . ., Y t be any sequence of values between 0 

and '/z. For each value t = 1, . . ., T, suppose that each of the t invocations of the weak 
learner on balanced distributions Di,t, with 0 < i < t - 1, yields a hj^othesis gi,t, which 
has advantage yt with respect to the balanced Di,t. In these conditions, the final 
branching progi-am hypothesis that MartiBoost constructs will satisfy: 
10 Prx e D [h(x) 7^ c(x)] < exp(- (E Y2 + yO' /8T). (5) 

In an exemplary embodiment, the MartiBoost algorithm is run on a 
fixed sample. In this case all relevant probabilities can be maintained explicitly in a 
look-up table, and then Theorem 2 bounds the training set accuracy of the MartiBoost. 
In another exemplary embodiment, the MartiBoost algoritlim is given access to an 
1 5 example oracle EX(c, D). In this version of the algorithm, for efficiency the 

execution of the algorithm may be frozen at nodes Vi,t, where it is too expensive to 
simulate the balanced distributions Di,t. 

Weak learning in the example oracle EX(c, D) framework may be 
defined as follows: Given a target function c : X — »{0,1 }, an algorithm A is said to be 
20 a weak learner if it satisfies the following property: for any 5 > 0 and any distribution 
D over X, if A is given 5 and access to EX(c, D), then algorithm A outputs a 
hypothesis h : X — >{0,1 }, which with a probability of at least 1- 5 satisfies: 

Prx e D[h(x) = c(x)] > Yi + y. (6) 

By definition, mA(5) is the running time of algorithm A, where one 
25 time step is charged for each invocation of the oracle EX(c, D). In mstances where 
algorithm A is run using a simulated oracle EX(c, D'), but with access only to oracle 
EX(c, D), the running time will be at most mA(8) times tlie amount of time it talces to 
simulate a draw from EX(c, D') given EX(c, D). 

An idealized version of the oracle algorithm ("Sampling MartiBoost", 
30 or "SMartiBoost") is designed to work with random examples assuming that all 

required probabilities can be computed exactly. For convenience, let r denote all of 
the random bits used by all the hypotheses hi,t. It may be convenient to think of r as 
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an infinite sequence of random bits that is determined before the algorithm starts and 
then read off one at a time as needed by the algorithm. In stage t of SMartiBoost, all 
nodes at levels t' < t have been labeled and the algorithm is labeling the t nodes Vo.t, . . 

vn, t in layer t. In the following, the probability that Pr ^ g d, r[x reaches Vj, J may be 
denoted as p i,t. Further, for each b e {0, 1 } , the probability that Pr x e d, r[x reaches 
Vi.t and the label of x is b] is denoted by p^^, so that p i,t = p°i,t *p' i,t. 

In stage t, for each node Vi,t the SMartiBoost algorithm does the 
following operations: 

1 . If min bE{o. 1) p'' i.t < e / T(T+1), then the SMartiBoost algorithm 
"fi-eezes" node Vi,t by labeling it with the bit (1 - b) and making it a terminal node with 
no outgoing edges so that any example x which reaches Vi_t will be assigned label (1 - 
b) by the branching program hypothesis. 

2. In the converse case min bs{o.i}p''u > e / T(T+1), the SMartiBoost 
algorithm works just like the MartiBoost algorithm in tliat it runs the weak learning 
algorithm on the balanced version Di_t to obtain a hypothesis gi.t. The algorithm labels 
Vi.t with hi,t = gi,t> which is gi,t balanced with respect to Dj.t. 

Each node vj.t which is frozen in operation (1) above contributes at 
most e / T(T+1) to the error of the final branching program hypothesis. The total error 
induced by all frozen nodes is at most 8 / 2, since there are at most T(T + l)/2 nodes in 
the branching program. Conversely, in the case min b6{0, i} p'' i,t > e / T(T+1) for any 
node Vi.t which is not firozen, the expected number of draws fi:om EX(c, D) that are 
required to simulate a draw from EX(c, D) is 0(T^/ e). Thus, the weak learner can be 
run efSciently on the desired distributions. 

r Theorem 3 below estabhshes the correctness of the SMartiBoost 

algorithm when all required probabilities are known exactly. 

Theorems. Let T == 8 ln(2/e)/ (y^). Suppose that each time the 
SMartiBoost algorithm is invoked on some balanced distribution Di.t, the weak learner 
outputs a hypothesis that has an advantage y with respect to Di.t. Then, the final 
branching program hypothesis h that SMartiBoost constructs will satisfy: 

PrxaD[h(x)^c(x)]<e. (7) 
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In the case where all required probabilities are not known exactly (i.e., 
in the presence of sampling error), Theorem 4 below establishes the correctness of the 
SMartiBoost algorithm. 

Theorem 4. Let T = 0 (log (l/e)/ (y^)), and the notation 0 hide 
polylogarithmic factors for the sake of readability. If A is a wealc learning algorithm 
that requires SAmany examples to construct a y-advantage hypothesis, then 
SMartiBoost makes O(sa)-0 (l/e)- poly(l/ y) many calls to EX(c, D) and with a 
probability of (1- 5) outputs a final hypothesis h that satisfies: 

Pr;,^D[h(x)7^c(x)]<e. • (8) 

The SMartiBoost algorithm can be flirtlier modified to withstand 
random classification noise. Given a distribution D and a value 0 < t) < 1/2, a noisy 
example oracle is an oracle EX(c, D, r\) which is defined as follows: each time EX(c, 
D, tj) is invoked, it returns a labeled example (x, b) 6 X x {0, 1), where x e X is 
drawn fi-om the distribution D, and b is chosen to be c(x) with a probability of (1 - t]) 
and chosen to be (l-c(x)) with a probability ofr\. 

It is useful here to recount the definition of weak learning. Weak 
learning may be defmed as follows: Given a target fimction c : X — »'{0,1}, an 
algorithm A is said to be a noise-tolerant wealc learning algorithm with an advantage y 
if it satisfies the following property: for any 5 > 0 and any distribution D over X, if A 
is given 5 and access to a noisy example oracle EX(c, D, r|) where 0 < ti < 1/2, then A 
runs in time poly(l/(l-2ri),l/5) and, with a probability of at least (1- 8), A outputs a 
hypothesis h that satisfies: 

PrxeD[h(x) = c(x)]>l/2 + ri. (9) 

In general for boosting algorithms, it is mathematically impossible to 
achieve an arbitrarily low error rate s below the noise rate r|. However, the noise- 
tolerant variant of the SMartiBoost algorithm, like the known modified Mansour and 
McAUester boosting algorithm, can achieve an error rate e = r] + x, in time polynomial 
in 1/ T and the other relevant parameters. (See e.g., Mansour and McAllester, 
"Boosting Using Branching Programs," Journal of Computer and System Sciences, 
64(1), pp. 103-112,2002). 
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A reason why SMartiBoost can be easily modified to withstand 
random classification noise is because in each stage t of boosting, the label b of a 
labeled example (x, b) plays only a limited role in the reweighting that the example 
experiences. Since this role is limited, it is possible to efficiently simulate the 
5 distributions that the weak learner requires at each stage of boosting and thus for the 
overall boosting process to succeed. 

For example, as a labeled example (x, b) proceeds through levels 1, . . 
., t-1 of the branching program in stage t, the path it takes is completely independent 
of b. Thus, given a source EX(c, D, x]) of noisy examples, the distribution of examples 

1 0 that arrive at a particular node Vi,t is precisely EX(c, Di,t, il). However, once a labeled 
example (x, b) arrives at some node Vi,t, label b must be consulted in the "rebalancing" 
of the distribution Di,t to obtain distribution Di_t. More precisely, the labeled examples 
that reach node Vi.t are distributed according to EX(c, Di,t, t)), but to use SMartiBoost 
with a noise-tolerant weak learner requires simulation of the balanced distribution Di,t 

1 5 corrupted with random classification noise, i.e., EX(c, Di,t, r]'). It is not necessary that 
the noise rate ti' in the balanced case be the same as r\. The SMartiBoost algorithm 
will work as long as the noise rate t]' is not to close to 1/2. 

Simulation of the balanced distribution Di_t corrupted with random 
classification noise EX(c, Di.t, r\') can take place according to the following rejection 

20 sampling procedure Lemma, which is similar to tiiat described in A. Kalai and R. 
Servedio, "Boosting In The Presence Of Noise," Proc. 35th Annual Symposium on 
Theory of Computing (STOC), pages 196-205, 2003. 

Rejection Sampling Procedure Lemma: Let x > 0 be any value 
satisfying t] + t/2 < 1/2. Suppose we have access to EX(c, D, t]). Let p denote 

25 Prx e d[c(x) = 1]. Further, suppose that r\ + x/2<.p < 1/2. Given a draw (x, b) firom 
EX(c, D, r|): 

1 . If b = 0, then with a probability of pr = (l-2p)/(l - p-ri) reject (x, 
b), and with a probability of 1- pr = (p -ti)/(1- p-ri) set b' = b and accept (x, b'); 



30 



2. If b = 1 , then set b' = 1 - b with probability 
Pf =(l-2p) 11(1- Ti)/(1- p - Ti) /(p + Ti - 2pTi) reject (x, b), set b' = b with a probability of 
1- Pf, and accept (x, b'). 
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Given a draw from EX(c, D, ti), the foregoing procedure rejects with a 

probability: 

p„j = (l-2p) (pTi +(1- p)(l-Ti)) /(I- p - n- 2pTi) (10a) 
and accepts with a probability: 

1- P:ej = 2(1- 2ti)(1- p)p /(I- p - Tl- 2pTl). (10b) 

Moreover, if the procedure accepts, then the (x, b') that it accepts is distributed 
according to EX(c, D, ti'), where t^' = 1/2 - (p - ri)/2(p + ti- 2pTi). 

The operation of the noise-tolerant SMartiBoost is described in the 
following: As previously, pi,t denotes the probability that Pr^ e d. r[x reaches Vj.t]. 
Further, cj°^ denotes the probability Prx e d. r[c(x) =b | x reaches Vj^t] = Prx e d, r[c(x) 
=b], so that q°i,t + q'i,t = 0. The noise-tolerant SMartiBoost talces as input a parameter 
T, where t] + x is a desired final accuracy. Without loss of generality, it may be 
assumed that t] + t < 1/2. 

In stage t, the noise-tolerant SMartiBoost algorithm does the following 
operations for each node v i_t : 

1 . If p""!,! < 2t / 3T(T+1), then the algorithm "freezes" node Vi,t by 
labeling it with an arbitrary bit and making it a terminal node with no outgoing edges. 

2. If min b e {o. i) q''i.t < ti /+ x/3, then the algorithm "freezes" node Vi.t 
by making it a terminal node labeled (1- b) with no outgoing edges. 

3. Otherwise, the algorithm i-uns the weak learning algorithm using 
EX(C, Dit, ri') as described in the Rejection Sampling Procedure Lemma to obtain a 
hypothesis gi,t. The algorithm labels Vi_t with hi.t = gi,t, which is gi,t, balanced with 
respect to Di,t. 

Theorem 5 below establishes the coiTCctness of the noise-tolerant 
SMartiBoost algorithm when all required probabilities are known exactly. 

Theorem 5. Let T = 8 hi(2/e)/ (y2). Suppose that each time a weak 
learner is invoked with some oracle EX(C, D i,t, tj'), and the weak learner outputs a 
hypothesis gi,t with Pr^ g Di.t[gi.t - c(x)] > 1/2 +7. Then the final branching program 
hypothesis h that the noise-tolerant SMartiBoost constructs will satisfy: 
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Prx.Di.t[h(x)^^c(x)]<Ti+ T. (11) 



In tlae case where all required probabilities are not known exactly, 
sufficiently accurate estimates of the probabilities can be obtained via a polynomial 
amount of sampling. Theorem 6 below establishes the correctness of the noise- 
5 tolerant SMartiBoost algorithm in such case. 

Theorem 6. Given any t such that t) + x < 1/2, let T = 0 (log (1/e)/ 
(y^)). If A is a noise-tolerant weak learning algorithm with an advantage y, tlien the 
noise-tolerant SMartiBoost makes poly(l/ y, 1/ x, 1/5) many calls to EX(c, D, r|) and 
with a probability of (1- 8) outputs a final hypothesis h that satisfies: 
10 PrxeD[h(x)7^c(x)]<Ti+ X. (12) 

Because of their simplicity and attractive theoretical properties, the 
inventive martingale boosting algorithms may advantageously be used in practical 
machine learning applications. A practical algorithm may involve repeatedly dividing 
the training data into bins, as opposed to using fi-esh examples during each stage as 

1 5 discussed above, for example, with respect to FIG. 1 and Theorem 1 . 

In an exemplary application, a machine learning system based on a 
martingale ranking algorithm is utilized for feeder failure prediction in a commercial 
electrical power distribution system. 

In the commercial power distribution system, power generated at 

20 remote power plants is delivered to residential, business, or industrial customers via a 
transmission network or grid. Power is first ti-ansmitted as high voltage transmissions 
from the remote power plants to geographically diverse substations. At the 
substations, the received high voltage power is sent over "feeders" to transformers 
that have low voltage outputs. The outputs of the transformers are connected to a 

25 local low voltage power distribution grid that can be tapped directly by the customers. 

In metropolitan areas (e.g., Manhattan) tlie feeders run under city 
streets, and are spliced together in manholes. Multiple or redundant feeders may feed 
the customer-tapped grid, so that individual feeders may fail without causing power 
outages. However, muhiple or collective feeder failures appear to be a potential 

30 failure mode through which power outages could occur. Preventive maintenance of 
the feeders is desirable. However, preventive maintenance schemes based on 
maintenance of every feeder in the system are expensive, cumbersome, and 



wo 2007/033300 PCTAJS200fi/035775 
15 

disruptive. Accordingly, power companies and utilities have developed empirical 
models for evaluating the danger that a feeder could fail. These models provide 
likelihood-of-failure scores, which may be used to prioritize repairs or maintenance. 
However, in practice, the scores obtained by using the empirical models are a weak 
5 guide and provide only a rough indication of likely failure events. 

Machine learning systems and methods based on martingale boosting 
or ranking algorithms may be advantageously applied to improve feeder failure 
predictions. 

One such machine learning system utilizes an input database, which 
1 0 includes a list of feeders, a list of scores for each feeder, and a historical record or 

count of recent failures for each feeder. The list of scores may capture the strength of 
evidence from a variety of sources or models that the particular feeder is error or 
failure prone. 

FIG. 2 shows exemplary leai'ning process 200 in the machine learning 

1 5 system for feeder failure predictions. At 2 1 0, the martingale boosting algorithm in the 
machine learning system, fmds the score or variable that has the strongest association 
with the past failure rate. For this purpose, the algorithm may be suitably coded, for 
example, to maximize a popular measure called the "Area Under The ROC Curve." 
Alternative measures may be used. At 220, the algoritlim sorts the feeder list by the 

20 score or variable that has the strongest association witli past failure. Then at 230, the 
algorithm divides the sorted list into two sublists so that past outages or failures are 
apportioned equally or at least approximately equally between the two sublists. At 
240, the algorithm determines the scores or variables that are best associated with the 
failure rate m each of the sublists and accordingly sorts the feeders in each of the 

25 sublists (250). At 260, the two sublists are combined together in one list. Next at 

270, the combined list is divided mto three sublists so that past outages or failui-es are 
apportioned equally or at least approximately equally between the three sublists. 

Training continues iteratively in the manner of 21 0-270. In the 
iterations, the list of feeders is progressively divided into finer and finer sublists. The 

30 algorithm determines the scores or variables that are best associated with the failure 
rate in each of the sublists and accordingly sorts each of the sublists. The sorted 
sublists are then recombined before the next finer iteration or division. After a 
number of iterations of subUst divisions, re-sorting and recombinations, the particular 
feeders that are predicted to be the most likely to fail are expected to rise to the top of 
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the recombiaed list. Thus, the feeders are ranked in order of their predicted likelihood 
of failure. Maintenance schedules for the feeders may advantageously give priority to 
the feeders at the top of the list. 

In machine learning practice, the number of sublist divisions and 
5 resorting steps may be suitably limited by considerations of processing time, cost, and 
return. 

In accordance with the present invention, software (i.e., instractions) 
for implementing the aforementioned machine learning systems and methods 
(algorithms) can be provided on computer-readable media. It will be appreciated that 

1 0 each of the steps (described above in accordance with this invention), and any 

combination of these steps, can be implemented by computer program instractions. 
These computer program instructions can be loaded onto a computer or other 
programmable apparatus to produce a machine such that the instructions, which 
execute on the computer or other programmable apparatus, create means for 

1 5 implementing the functions of the aforementioned machine learning systems and 
methods. These computer program instructions can also be stored in a computer- 
readable memory that can direct a computer or other programmable apparatus to 
fvmction in a particular manner such that the instructions stored in the computer- 
readable memory produce an article of manufacture including instruction means, 

20 which implement the fimctions of the aforementioned machine learning systems and 
methods. The computer program instructions can also be loaded onto a computer or 
other programmable apparatus to cause a series of operational steps to be performed 
on the computer or other programmable apparatus to produce a computer- 
implemented process such that the instructions which execute on the computer or 

25 other programmable apparatus provide steps for implementiag the fimctions of the 
aforementioned machine learning systems and methods. It will also be understood 
that the computer-readable media on which mstructions for implementing the 
aforementioned machine learning systems arid methods are to be provided include, 
without limitation, fu-mware, microcontrollers, microprocessors, integrated circuits, 

30 ASICS, and other available media. 

The foregoing merely illustrates the principles of the invention. 
Various modifications and alterations to the described embodiments will be apparent 
to those skilled in the art in view of the teachings herem, including by combining 
different features from different disclosed embodiments. It will thus be appreciated 
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that those skilled in the art will be able to devise numerous techniques which, 
although not explicitly described herein, embody the principles of the invention and 
are thus within the spirit and scope of the invention. 
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Claims: 
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1 . A machine learning method having a multiple number of learning stages, each 
learning stage comprising: 

5 partitioning examples into bins; 

choosing a base classifier for each bin; and 

assigning an example to a bin by coimting the number of positive predictions 
previously made by the base classifier associated with the bin. 

2. The method of claim 1, wherein assigning an example to a in comprises 

1 0 classifying an example by a random walk on the number of base classifiers that are 
positive predictions. 

3 . The method of claim 1, wherein assigning an example to a bin comprises 
balancing error rates substantially equally between false positives and false negatives. 

4. The method of claim 1, wherein assigning an example to a in comprises 

1 5 assigning a particular example to a particular bin independent of any label associated 
witli the example. 

5. The method of claim 1, further comprising using decision stumps as base 
classifiers. 

6. A machine learning system for automated learning in stages, the system 
20 comprising a boosting algorithm that is configured at each learning stage to: 

partition training examples into bins; 
choose a base classifier for each bin; and 

assign an example to a bin by counting the number of positive predictions ■ 
previously made by the base classifier associated with the bin, 
25 whereby at each learning stage the false positive and false negative error rates are 
substantially balanced. 

7. A computer readable medium for machine learning from a training data set, 
the computer readable medium comprising a set of instructions for: 

partitioning the training data set into bins; 
30 choosing a base classifier for each bin; and 

assigning a datum to a bin by counting the number of positive predictions 
previously made by the base classifier associated with the bin. 
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8. A machine learning method for predicting the behavior of objects, wherein 
past behaviors of the objects are known and wherein the objects are associated with a 
plurality of scores that are predictive of object behavior, the method having a multiple 
number of learning stages, each learning stage comprising: 

5 partitioning a list of objects into a number of sublists so that past behaviors of 

the objects are distributed substantially evenly across the number of sublists; 

for each sublist, choosing a predictive score firom the plurality of predictive 
scores associated with the objects in the sublist; 

for each sublist, ranking objects in the sublist according to the chosen 
10 predictive score; and then 

recombining the sublists to generate a list in which the objects are ranlced 
according to the predictive scores chosen for the respective sublists. 

9. The method of claim 8, wherein choosing a predictive score for each sublist 
comprises selecting the predictive score that most accurately predicts the past 

1 5 behavior of the objects in the sublist. 

1 0. The method of claim 8, wherein partitioning a list of objects into a number of 
sublists comprises partitioning the list of objects into an increasing number of sublists 
at each successive learning stage. 

1 1 . The method of claim 8, wherein the objects are feeders in an electrical power 
20 distribution system, and wherein the past behaviors are feeder failure events. 

12. A machine learning system for predicting the failure of feeders in an electrical 
power distribution system, wherein past feeder failure events are known and wherein 
the feeders are associated with a plurality of scores that are predictive of feeder 
failure, the system comprising an algorithm configured to process a list of feeders and 

25 the associated plurality of scores in a number of successive learning stages, each 
learning stage comprising: 

partitioning the list of feeders into a number of sublists so that the past feeder 
failure events are distributed substantially evenly across the number of subhsts; ■ 
for each sublist, choosing a predictive score from the plurality of predictive 
30 scores associated with the objects in the sublist; 

for each sublist, ranking feeders in the sublist according to the chosen 
predictive score; and then 
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recombining the sublists to generate a list in which the feeders are ranked 
according to the predictive scores chosen for the respective sublists. 

1 3 . The machine learning system of claim 1 2, wherein the algorithm is configured 
to partition the list of feeders into an increasing number of sublists at each successive 

5 learning stage. 

14. The machine learning system of claim 1 2, wherein the algorithm is configui-ed 
to choose for a sublist the predictive score that most accurately predicts the past 
feeder failure events for the feeders in the sublist. 
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Fig. 1. The branching program 
produced by the boosting algo- 
rithm. Each node ■Ui.t is labeled 
with a 0/1-vaIued function hi,t\ 
left edges correspond to 0 and 
right edges to 1. 
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Learning process 200 



210 

Find score having the strongest association 
with the past failure rate 
" 220 

Sort feeder list by the score most associated 
with past failure 
"230 

Divide sorted list into two sublists 
" 240 

Find the score having the strongest 
association with the past failure rate in each 
sublist 
"250 

Sort the feeders in each of the sublists by the 

score most associated with past failure 
__ 

Combine the two sublists together in one 
Ust 

" 270 ~ 
Divide tlie combined list into three sublists 



